On-line Approximate String Matching in Natural Language

نویسنده

  • Kimmo Fredriksson
چکیده

We consider approximate pattern matching in natural language text. We use the words of the text as the alphabet, instead of the characters as in traditional string matching approaches. Hence our pattern consists of a sequence of words. From the algorithmic point of view this has several advantages: (i) the number of words is much less than the number of characters, which in effect means shorter text (less possible matching positions); (ii) the pattern effectively becomes shorter, so bit-parallel techniques become more applicable; (iii) the alphabet size becomes much larger, so the probability that two symbols (in this case, words) match is reduced. We extend several known approximate string matching algorithms for this scenario, allowing k insertions, deletions or substitutions of symbols (natural language words). We further extend the algorithms to allow k′ errors inside the pattern symbols (words) as well. The two error thresholds k and k′ can be applied simultaneously and independently. Hence we have in effect two alphabets, and perform approximate matching in both levels. From the application point of view the advantage is that the method is flexible, allowing simple solutions to problems that are hard to solve with traditional approaches. Finally, we extend the algorithms to handle multiple patterns at the same time. Depending on the search parameters, we obtain algorithms that run in linear or sublinear time and that perform the optimal number of word comparisons on average, We conclude with experimental results showing that the methods work well in practice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Fingerprint Algorithm Based on Line-Segment Chain

A novel representation of fingerprint image based on Line-Segment features which are extracted from a bank of Gabor-filtered fingerprint images is proposed in this paper. In the feature matching stage, quadrangles constructed with Perpendicularity Line-Segments are used to obtain the alignment parameters. Compared with the existing representations of fingerprint, line-Segment based representati...

متن کامل

Faster Bit-Parallel Approximate String Matching

We present a new bit-parallel technique for approximate string matching. We build on two previous techniques. The rst one Myers, J. of the ACM, 1999], searches for a pattern of length m in a text of length n permitting k diierences in O(mn=w) time, where w is the width of the computer word. The second one Navarro and Raanot, ACM JEA, 2000], extends a sublinear-time exact algorithm to approximat...

متن کامل

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

In this paper, we propose a novel approach for assisting human bug triagers in large open source software projects by semi-automating the bug assignment process. Our approach employs a simple and efficient n-gram-based algorithm for approximate string matching on the character level. We propose and implement a recommender prototype which collects the natural language textual information availab...

متن کامل

An Approximate String Matching Algorithm Based upon the Candidate Elimination Method

In this paper, we consider the approximate string matching problem. We give a method to eliminate candidate locations in text T as there can be no substring S starting from those locations such that the edit distance between S and pattern P is smaller than or equal to a specified error bound k . Our method is simple to implement. Experimental results show that our method is effective, especiall...

متن کامل

A Fast Heuristic forApproximate String Matching 2

We study a fast algorithm for on-line approximate string matching. It is based on a non-deterministic nite automaton, which is simulated using bit-parallelism. If the automaton does not t in a computer word, we partition the problem into subproblems. We show experimentally that this algorithm is the fastest for typical text search. We also show which algorithms are the best in other cases, and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Fundam. Inform.

دوره 72  شماره 

صفحات  -

تاریخ انتشار 2006